From "Papalexi, E., and Satija, R. (2018). Single-cell RNA sequencing to explore immune cell heterogeneity. Nature Reviews Immunology 18, 35–45." :
Single-cell RNA sequencing (scRNA-seq) technologies use many different methods for cell isolation and transcript amplification. Whereas some technologies capture cells using microfluidic devices that trap cells inside hydrogel droplets, other technologies rely on methods (such as fluorescence-activated cell sorting (FACS) into 96‐well plates and the microfluidic chips used by Fluidigm C1) that physically separate one cell from another in wells. Once cells are lysed, reverse transcription and PCR amplification are carried out. Droplet-based approaches, and some plate-based approaches, allow for pooled PCR amplification using cellular barcoding techniques, which decreases the cost as only one PCR reaction is required per experiment or plate. In other plate-based approaches and for Fluidigm C1, the number of PCR amplification reactions is equal to the number of cells that are being profiled, which makes these approaches expensive. PCR products are further processed to prepare samples for sequencing. Some approaches that use sequencing of the 3ʹ end of each transcript allow for quantification of expression of each gene within a cell. Other approaches, however, can sequence full-length transcripts, which allows not only for detection of gene expression but also for analysis of splicing variants and B cell receptor (BCR) or T cell receptor (TCR) repertoire diversity. InDrop, indexing droplets sequencing; MARS-seq, massively parallel single-cell RNA sequencing.
From "Ziegenhain, C., Vieth, B., Parekh, S., Reinius, B., Guillaumet-Adkins, A., Smets, M., Leonhardt, H., Heyn, H., Hellmann, I., and Enard, W. (2017). Comparative Analysis of Single-Cell RNA Sequencing Methods. Molecular Cell 65, 631-643.e4."" :
From McCaroll lab website :
Millions of paired-end reads are generated from a Drop-seq library on a high-throughput sequencer. The reads are first aligned to a reference genome to identify the gene-of-origin of the cDNA. Next, reads are organized by their cell barcodes, and individual UMIs are counted for each gene in each cell. The result, shown at far right, is a "digital expression matrix" in which each column corresponds to a cell, each row corresponds to a gene, and each entry is the integer number of transcripts detected from that gene, in that cell.
10X genomics datasets hgmm_100
100 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells Single Cell Gene Expression Dataset by Cell Ranger 2.1.0
1:1 mixture of fresh frozen human (HEK293T) and mouse (NIH3T3) cells. This is a classic human-mouse mixture experiment to demonstrate single cell behavior.
100 cells detected Sequenced on Illumina Hiseq4000 with approximately 70,000 reads per cell 26bp read1 (16bp Chromium barcode and 10bp UMI), 98bp read2 (transcript), and 8bp I7 sample barcode Analysis run with --expect-cells=100 Analysis run with reference version 1.2.0
Published on November 8, 2017
This dataset is licensed under the Creative Commons Attribution license.
wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/hgmm_100/hgmm_100_fastqs.tar
qlogin -pe thread 8 -q formation.q
cdprojet
mkdir Mapping_QC_Quantification
cd Mapping_QC_Quantification
ln -s /projet/sbr/sincellte/Mapping_QC_Quantification/Data .
ln -s /projet/sbr/sincellte/Mapping_QC_Quantification/Tools .
ln -s /projet/sbr/sincellte/Mapping_QC_Quantification/run_dropseqtools_only.bash .
Expected SampleSheet format :
From Illumina website :
cat(readLines('html_files/NovaSeq_SampleSheet.csv'), sep = '\n')
[Header]
IEMFileVersion,5
Investigator Name,MD
Experiment Name,sincellte
Date,31/05/2018
Workflow,GenerateFASTQ
Application,NovaSeq FASTQ Only
Instrument Type,NovaSeq
Assay,Chromium SingleCell 10x
Index Adapters,"Chromium SingleCell 10x Indexes (4x96 Indexes)"
Description,PE26-91_SingleCell-10X
Chemistry,Default
[Reads]
26
91
[Settings]
[Data]
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,Sample_Project,Description
s1_1,test_sample,,,SI-GA-A9_1,TCTTAAAG,sincellte,Homo_sapiens
s1_2,test_sample,,,SI-GA-A9_2,CGAGGCTC,sincellte,Homo_sapiens
s1_3,test_sample,,,SI-GA-A9_3,GTCCTTCT,sincellte,Homo_sapiens
s1_4,test_sample,,,SI-GA-A9_4,AAGACGGA,sincellte,Homo_sapiens
Here, SI-GA-A9 refers to a 10x sample index, a set of four oligo sequences (TCTTAAAG, CGAGGCTC, GTCCTTCT, AAGACGGA)
10X Sample Index Sets for Single Cell 3' website
export PATH=/opt/sge/bin/lx24-amd64/:$PATH
export SGE_CLUSTER_NAME=SGE
source /usr/local/genome2/cellranger/sourceme.bash
/usr/local/genome2/bin/cellranger mkfastq --run=${FLOWCELL_DIR} --samplesheet=${SAMPLE_SHEET_PATH} --jobmode=sge --mempercore=8 --output-dir=${OUTPUT_DIR}
Bcl2fastq manual from Illumina website
bcl2fastq --use-bases-mask=Y26,I8,Y98 \
--create-fastq-for-index-reads \
--minimum-trimmed-read-length=8 \
--mask-short-adapter-reads=8 \
--ignore-missing-positions \
--ignore-missing-filter \
--ignore-missing-bcls \
-r 6 -w 6 \
-R ${FLOWCELL_DIR} \
--output-dir=${OUTPUT_DIR} \
--interop-dir=${INTEROP_DIR} \
--sample-sheet=${SAMPLE_SHEET_PATH}
### Real time from 2h to 6h ###
Parameters list :
--use-bases-mask : The --use-bases-mask string specifies how to use each cycle.
• An n means ignore the cycle.
• A Y (or y) means use the cycle.
• An I means use the cycle for the Index Read.
• A number means that the previous character is repeated that many times. An asterisk [*] means that the previous character is repeated until the end of this read or index (length according to the RunInfo.xml).
The read masks are separated with commas: ,
The format for dual indexing is as follows: --use-bases-mask Y*,I*,I*,Y* or variations thereof as specified.
You can also specify the --use-bases-mask multiple times for separate lanes, like this way: --use-bases-mask 1:y*,i*,i*,y* --use-bases-mask y*,n*,n*,y*
Where the 1: means: Use this setting for lane 1. In this case, the second --use-bases-mask parameter is used for all other lanes.
If this option is not specified, the mask is determined from the 'RunInfo.xml file in the run directory. If it cannot do this determination, supply the --use-bases-mask.
When the --use-bases-mask option is specified, the number of index cycles and the length of index in the sample sheet should match.
--create-fastq-for-index-reads : Create FASTQ files also for Index Reads. Generating FASTQ files is based on the following:
• The index read masks are specified from the --use-bases-mask option.
• The RunInfo.xml file when the --use-bases-mask option is not used.
--minimum-trimmed-read-length : Minimum read length after adapter trimming. bcl2fastq trims the adapter from the read down to the value of this parameter. If there is more adapter match below this value, then those bases are masked, not trimmed (replaced by N rather than removed).
Default: 35
--mask-short-adapter-reads : This option applies when a read is shorter than the length specified by --minimum-trimmed-read-length (note that the read does not specifically have to be trimmed for this option to trigger, it need only fall below the —minimum-trimmed-read-length for any reason). These parameters specify the following behavior:
• If the number of bases left after adapter trimming is less than --minimum-trimmed-read-length, force the read length to be equal to --minimum-trimmed-read-length by masking adapter bases (replace with Ns) that fall below this length.
• If the number of ACGT bases left after this process falls below --mask-short-adapter-reads, mask all bases, resulting in a read with --minimum-trimmed-read-length number of Ns. In addition, if a read is shorter than--mask-short-adapter-reads for any reason, it will be masked with Ns. Because it applies when a read is shorter than the value of --minimum-trimmed-read-length, it should be set to a value that is less than or equal to this parameter.
• If it is set to a greater value, it will automatically default to the same value as --minimum-trimmed-read-length.
Default: 22
--ignore-missing-positions : Missing or corrupt positions files are ignored. If corresponding position files are missing, bcl2fastq writes unique coordinate positions in FASTQ header.
--ignore-missing-filter : Missing or corrupt filter files are ignored. Assumes Passing Filter for all clusters in tiles where filter files are missing.
--ignore-missing-bcls : Missing or corrupt BCL files are ignored. Assumes 'N'/'#' for missing calls
-r, --loading-threads : Number of threads used for loading BCL data.
Default: 4
-w, --writing-threads : Number of threads used for writing FASTQ data. This number should not be set higher than number of samples.
Default: 4
-R, --runfolder-dir : Path to run folder directory
Default: ./
-o, --output-dir : Path to demultiplexed output
Default: <runfolder-dir>/Data/Intensities/BaseCalls/
--interop-dir : Path to demultiplexing statistics directory
Default: <runfolder-dir>/InterOp/
--sample-sheet : Path to sample sheet, so you can specify the location and name of the sample sheet, if different from default.
Default: <runfolder-dir>/SampleSheet.csv
From FASTQ to count table :
From count table to post-analysis :
mkdir -p Results/cellranger_local_expected_cells_100_nosecondary
cd Results/cellranger_local_expected_cells_100_nosecondary
source /usr/local/genome2/cellranger/sourceme.bash
/usr/local/genome2/bin/cellranger count --id=count_hgmm_100_hg19_mm10 --transcriptome=/db/off_biomaj/cellranger/refdata-cellranger-hg19_and_mm10-2.1.0 --fastqs=${ROOT_FOLDER}/Data/fastqs/original --sample=hgmm_100 --jobmode=local --localcores=8 --localmem=50 --expect-cells=100 --nosecondary
### Real time 4h40min with secondary analysis ###
### Real time 1h34min without secondary analysis ###
Parameters list :
--id : A unique run ID string: e.g. sample345
--transcriptome : Path to the Cell Ranger compatible transcriptome reference e.g.
• For a human-only sample : [GRCh38 v1.2.0](http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-GRCh38-1.2.0.tar.gz)
• For a mouse-only sample : [mm10 v2.1.0](http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-mm10-2.1.0.tar.gz) or [mm10 v1.2.0](http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-mm10-1.2.0.tar.gz)
• For a human and mouse mixture sample : [hg19/mm10 v2.1.0](http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-hg19-and-mm10-2.1.0.tar.gz)
• For sample with ERCC spike-ins : [92 ERCC v1.2.0](http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-ercc92-1.2.0.tar.gz)
--fastqs :
• Path of the fastq_path folder generated by cellranger mkfastq. This contains a directory hierarchy that cellranger count will automatically traverse.
or
• Any folder containing fastq files, for example if the fastq files were generated by a service provider and delivered outside the context of the mkfastq output directory structure. Can take multiple comma-separated paths, which is helpful if the same library was sequenced on multiple flowcells. Doing this will treat all reads from the library, across flowcells, as one sample. If you have multiple libraries for the sample, you will need to run cellranger count on them individually, and then combine them with cellranger aggr.
--sample : Sample name as specified in the sample sheet supplied to cellranger mkfastq. Can take multiple comma-separated values, which is helpful if the same library was sequenced on multiple flowcells and the sample name used (and therefore fastq file prefix) is not identical between them. Doing this will treat all reads from the library, across flowcells, as one sample. If you have multiple libraries for the sample, you will need to run cellranger count on them individually, and then combine them with cellranger aggr. Allowable characters in sample names are letters, numbers, hyphens, and underscores.
--jobmode : Job manager to use (local, sge, lsf or a .template file)
Default: local
--expect-cells : (optional) Expected number of recovered cells.
Default: 3,000 cells.
--nosecondary : (optional) Add this flag to skip secondary analysis of the gene-barcode matrix (dimensionality reduction, clustering and visualization). Set this if you plan to use cellranger reanalyze or your own custom analysis.
- web_summary.html : Run summary metrics and charts in HTML format - metrics_summary.csv : Run summary metrics in CSV format
## [,1]
## Estimated.Number.of.Cells "100"
## Mean.Reads.per.Cell "71,976"
## Number.of.Reads "7,197,662"
## Valid.Barcodes "98.5%"
## Sequencing.Saturation "29.2%"
## Q30.Bases.in.Barcode "97.8%"
## Q30.Bases.in.RNA.Read "80.7%"
## Q30.Bases.in.Sample.Index "95.5%"
## Q30.Bases.in.UMI "97.9%"
## Reads.Mapped.to.Genome "95.0%"
## hg19.Reads.Mapped.to.Genome "59.3%"
## mm10.Reads.Mapped.to.Genome "35.9%"
## Reads.Mapped.Confidently.to.Genome "91.0%"
## hg19.Reads.Mapped.Confidently.to.Genome "57.5%"
## mm10.Reads.Mapped.Confidently.to.Genome "33.5%"
## Reads.Mapped.Confidently.to.Intergenic.Regions "2.8%"
## hg19.Reads.Mapped.Confidently.to.Intergenic.Regions "1.8%"
## mm10.Reads.Mapped.Confidently.to.Intergenic.Regions "1.1%"
## Reads.Mapped.Confidently.to.Intronic.Regions "9.6%"
## hg19.Reads.Mapped.Confidently.to.Intronic.Regions "6.7%"
## mm10.Reads.Mapped.Confidently.to.Intronic.Regions "3.0%"
## Reads.Mapped.Confidently.to.Exonic.Regions "78.5%"
## hg19.Reads.Mapped.Confidently.to.Exonic.Regions "49.1%"
## mm10.Reads.Mapped.Confidently.to.Exonic.Regions "29.4%"
## Reads.Mapped.Confidently.to.Transcriptome "74.9%"
## hg19.Reads.Mapped.Confidently.to.Transcriptome "46.6%"
## mm10.Reads.Mapped.Confidently.to.Transcriptome "28.3%"
## Reads.Mapped.Antisense.to.Gene "0.9%"
## hg19.Estimated.Number.of.Cell.Partitions "50"
## mm10.Estimated.Number.of.Cell.Partitions "50"
## Fraction.Reads.in.Cells "85.0%"
## hg19.Fraction.Reads.in.Cells "83.7%"
## mm10.Fraction.Reads.in.Cells "86.7%"
## hg19.Median.Genes.per.Cell "5,475"
## mm10.Median.Genes.per.Cell "3,965"
## hg19.Total.Genes.Detected "14,030"
## mm10.Total.Genes.Detected "11,559"
## hg19.Median.UMI.Counts.per.Cell "38,027"
## mm10.Median.UMI.Counts.per.Cell "23,130"
## GEMs.with..0.Cells "100"
- possorted_genome_bam.bam : Reads aligned to the genome and transcriptome annotated with barcode information - possorted_genome_bam.bam.bai : Index for possorted_genome_bam.bam
- filtered_gene_bc_matrices : Filtered gene-barcode matrices containing only cellular barcodes in MEX format (input folder for downstream analysis packages such as Seurat containing 3 files : "barcodes.tsv", "genes.tsv" and "matrix.mtx")
- filtered_gene_bc_matrices_h5.h5 : Filtered gene-barcode matrices containing only cellular barcodes in HDF5 format
- raw_gene_bc_matrices : Unfiltered gene-barcode matrices containing all barcodes in MEX format
- raw_gene_bc_matrices_h5.h5 : Unfiltered gene-barcode matrices containing all barcodes in HDF5 format
- molecule_info.h5 : Molecule-level information used by cellranger aggr to aggregate samples into larger datasets.
From 10X Single-cell community forum :
Cellranger count performs some basic quality filtering and error correction for UMI sequencing errors.
The following criteria are used for UMI filtering:
UMIs that are 1 Hamming distance (substitution) away from a higher-count UMI are corrected to the higher count UMI if they share a cell barcode.
With the 10x workflow, transcripts are tagged with a 10bp barcode called UMI that is unique for a mRNA molecule in a cell. Therefore reads that have a common UMI are counted only once which enables accurate quantitation of a cell expression levels.
In CellRanger, the UMI counts are normalized before secondary analysis as follows:
Before PCA, the following is done to normalize UMI counts:
The PCA transformed values are then used to generate tSNE plots clustering.
There is some normalization built into the differential expression algorithm (sSeq) used by CellRanger. It's basically a scaling factor related to the size of the library.
The ERCC ExFold RNA Spike-In Mixes provide a set of external RNA controls that enable performance assessment of a variety of technology platforms used for gene expression experiments. Key product features:
If your sample contain ERCC spike-ins, DO NOT FORGET to add ERCC sequences in your genome/transcriptome FASTA AND GTF files in order to be present in your count table.
head -n 5 html_files/ERCC92.fa
head -n 5 html_files/ERCC92.gtf
## >ERCC-00002
## TCCAGATTACTTCCATTTCCGCCCAAGCTGCTCACAGTATACGGGCGTCG
## GCATCCAGACCGTCGGCTGATCGTGGTTTTACTAGGCTAGACTAGCGTAC
## GAGCACTATGGTCAGTAATTCCTGGAGGAATAGGTACCAAGAAAAAAACG
## AACCTTTGGGTTCCAGAGCTGTACGGTCGCACTGAACTCGGATAGGTCTC
## ERCC-00002 ERCC exon 1 1061 0.000000 + . gene_id "ERCC-00002"; transcript_id "DQ459430";
## ERCC-00003 ERCC exon 1 1023 0.000000 + . gene_id "ERCC-00003"; transcript_id "DQ516784";
## ERCC-00004 ERCC exon 1 523 0.000000 + . gene_id "ERCC-00004"; transcript_id "DQ516752";
## ERCC-00009 ERCC exon 1 984 0.000000 + . gene_id "ERCC-00009"; transcript_id "DQ668364";
## ERCC-00012 ERCC exon 1 994 0.000000 + . gene_id "ERCC-00012"; transcript_id "DQ883670";
ROOT_FOLDER=`pwd`
source ${CONDA3}/activate sincellte2018
mkdir -p Results/dropseqtools/01_FastqToSam
java -jar /usr/local/genome2/picard-tools/FastqToSam.jar FASTQ=Data/hgmm_100_R1.fastq FASTQ2=Data/hgmm_100_R2.fastq QUALITY_FORMAT=Standard SAMPLE_NAME=hgmm_100 OUTPUT=Results/dropseqtools/01_FastqToSam/my_unaligned_data.bam
#5min
Parameters list :
QUALITY_FORMAT=FastqQualityFormat : A value describing how the quality values are encoded in the fastq. Either Solexa for pre-pipeline 1.3 style scores (solexa scaling + 66), Illumina for pipeline 1.3 and above (phred scaling + 64) or Standard for phred scaled scores with a character shift of 33. If this value is not specified, the quality format will be detected automatically. Default value: null. Possible values: {Solexa, Illumina, Standard}
OUTPUT=File : Output SAM/BAM file. Required.
READ_GROUP_NAME=String : Read group name Default value: A. This option can be set to 'null' to clear the default value.
SAMPLE_NAME=String : Sample name to insert into the read group header Required.
LIBRARY_NAME=String : The library name to place into the LB attribute in the read group header Default value: null.
PLATFORM_UNIT=String : The platform unit (often run_barcode.lane) to insert into the read group header Default value: null.
PLATFORM=String : The platform type (e.g. illumina, solid) to insert into the read group header Default value: null.
SEQUENCING_CENTER=String : The sequencing center from which the data originated Default value: null.
PREDICTED_INSERT_SIZE=Integer : Predicted median insert size, to insert into the read group header Default value: null.
COMMENT=String : Comment(s) to include in the merged output file's header. This option may be specified 0 or more times.
DESCRIPTION=String : Inserted into the read group header Default value: null.
RUN_DATE=Iso8601Date : Date the run was produced, to insert into the read group header Default value: null.
SORT_ORDER=SortOrder : The sort order for the output sam/bam file. Default value: queryname. This option can be set to 'null' to clear the default value. Possible values: {unsorted, queryname, coordinate}
MIN_Q=Integer : Minimum quality allowed in the input fastq. An exception will be thrown if a quality is less than this value. Default value: 0. This option can be set to 'null' to clear the default value.
MAX_Q=Integer : Maximum quality allowed in the input fastq. An exception will be thrown if a quality is greater than this value. Default value: 93. This option can be set to 'null' to clear the default value.
STRIP_UNPAIRED_MATE_NUMBER=Boolean : If true and this is an unpaired fastq any occurance of '/1' will be removed from the end of a read name. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ALLOW_AND_IGNORE_EMPTY_LINES=Boolean : Allow (and ignore) empty lines Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
From SAM specifications
:
This program extracts bases from the cell barcode encoding read (BARCODED_READ), and creates a new BAM tag with those bases on the genome read. By default, we use the BAM tag XC for cell barcodes, using the TAG_NAME parameter.
mkdir -p Results/dropseqtools/02_TagBamWithReadSequenceExtended_XC
Tools/Drop-seq_tools-1.13/TagBamWithReadSequenceExtended INPUT=Results/dropseqtools/01_FastqToSam/my_unaligned_data.bam OUTPUT=Results/dropseqtools/02_TagBamWithReadSequenceExtended_XC/unaligned_tagged_Cell.bam SUMMARY=Results/dropseqtools/02_TagBamWithReadSequenceExtended_XC/unaligned_tagged_Cellular.bam_summary.txt BASE_RANGE=1-16 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=False TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1
#7min
Parameters list :
SUMMARY=File : Summary of barcode base quality Default value: null.
BASE_RANGE=String : Base range to extract, seperated by a dash. IE: 1-4. Can extract multiple ranges by seperating them by a colon. For example 1-4:17-22 extracts the first 4 bases, then the 17-22 bases, and glues the sequence together into a single sequence for a tag. Required.
BARCODED_READ=Integer : The sequence can be from the first or second read [1/2]. Required.
TAG_BARCODED_READ=Boolean : Add the tag to the sequence the read came from? If false, the read that does not have the barcode gets the tag. If true, set the tag on the barcoded read. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
DISCARD_READ=Boolean : Discard the read the sequence came from?. If this is true, then the remaining read is marked as unpaired. If the read is unpaired, then you can't discard a read. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
HARD_CLIP_BASES=Boolean : Should the bases selected for the tag be hard clipped from the read? BE VERY CAREFUL WITH THIS FEATURE, FOR EXPERTS ONLY. NOT NEEDED FOR STANDARD DROPSEQ DATA PROCESSING.Don't use on aligned data, does NOT change cigar strings Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
BASE_QUALITY=Integer : Minimum base quality required for barcode Default value: 10. This option can be set to 'null' to clear the default value.
NUM_BASES_BELOW_QUALITY=Integer : Number of bases below minimum base quality to fail the barcode. Default value: 1. This option can be set to 'null' to clear the default value.
TAG_NAME=String : Barcode tag. This is typically X plus one more capitalized alpha. For example, 'XS', which is the default. Default value: XS. This option can be set to 'null' to clear the default value.
This program extracts bases from the cell barcode encoding read (BARCODED_READ), and creates a new BAM tag with those bases on the genome read. By default, we use the BAM tag XM for molecular barcodes, using the TAG_NAME parameter.
mkdir -p Results/dropseqtools/03_TagBamWithReadSequenceExtended_XM
Tools/Drop-seq_tools-1.13/TagBamWithReadSequenceExtended INPUT=Results/dropseqtools/02_TagBamWithReadSequenceExtended_XC/unaligned_tagged_Cell.bam OUTPUT=Results/dropseqtools/03_TagBamWithReadSequenceExtended_XM/unaligned_tagged_CellMolecular.bam SUMMARY=Results/dropseqtools/03_TagBamWithReadSequenceExtended_XM/unaligned_tagged_Molecular.bam_summary.txt BASE_RANGE=17-26 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=True TAG_NAME=XM NUM_BASES_BELOW_QUALITY=1
#6min
This program is used to remove reads where the cell or molecular barcode has low quality bases. During the run of TagBamWithReadSequenceExtended, an XQ tag is added to each read to represent the number of bases that have quality scores below the BASE_QUALITY threshold. These reads are then removed from the BAM.
mkdir -p Results/dropseqtools/04_FilterBAM
Tools/Drop-seq_tools-1.13/FilterBAM TAG_REJECT=XQ INPUT=Results/dropseqtools/03_TagBamWithReadSequenceExtended_XM/unaligned_tagged_CellMolecular.bam OUTPUT=Results/dropseqtools/04_FilterBAM/unaligned_tagged_filtered.bam
#6min
Parameters list :
MINIMUM_MAPPING_QUALITY=Integer : Minimum mapping quality to consider the read Default value: null.
FILTER_PCR_DUPES=Boolean : Should PCR duplicates be filtered? Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
RETAIN_ONLY_PRIMARY_READS=Boolean : Retain primary reads only Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
SUM_MATCHING_BASES=Integer : Retain reads that have at least this many M bases total in the cigar string. This sums all the M's in the cigar string. Default value: null.
REF_SOFT_MATCHED_RETAINED=String : Soft match reference names that have this string. If multiple matches are specified, they are OR'd together.This is the equivalent of a hard match with wrapped with .* on either side. Default value: null. This option may be specified 0 or more times.
REF_SOFT_MATCHED_REJECTED=String : Soft match and reject reference names that have this string. If multiple matches are specified, they are OR'd together. This is the equivalent of a hard match with wrapped with .* on either side. Default value: null. This option may be specified 0 or more times.
REF_HARD_MATCHED_RETAINED=String : Exact match reference names that have this string. If multiple matches are specified, they are OR'd together.For example, '1' would retain only references that were exactly '1'. This method accepts regular expressions. Default value: null. This option may be specified 0 or more times.
REF_HARD_MATCHED_REJECTED=String : Exact match and reject reference names that have this string. If multiple matches are specified, they are OR'd together.This method accepts regular expressions. Default value: null. This option may be specified 0 or more times.
TAG_RETAIN=String : Retain reads that have these tags set with any value. Can be set multiple times Default value: null. This option may be specified 0 or more times.
TAG_RETAIN_COMBINE_FLAG=String : If multiple TAG_RETAIN flags are set, should the result be the union of the filters, or the intersect? [UNION/INTERSECT]. Default value: null.
TAG_REJECT=String : Reject reads that have these tags set with any value. Can be set multiple times. Default value: null. This option may be specified 0 or more times.
TAG_REJECT_COMBINE_FLAG=String : If multiple TAG_REJECT flags are set, should the result be the union of the filters, or the intersect? [UNION/INTERSECT]. Default value: null.
STRIP_REF_PREFIX=String : Edit contig names so that a contig that starts with one of these prefixes has the prefix stripped. Default value: null. This option may be specified 0 or more times.
DROP_REJECTED_REF=Boolean : Edit sequence dictionary and remove any contig that has been filtered by reference name filtering. A read with mate alignment info in which mate is aligned to a contig that has been removed will be changed to have an unmapped mate. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
This program is one of two sequence cleanup programs designed to trim away any extra sequence that might have snuck it’s way into the reads. In this case, we trim the adapter that can occur 5’ of the read. In our standard run, we look for at least 5 contiguous bases (NUM_BASES) of the adapter (SEQUENCE) at the 5’ end of the read with no errors (MISMATCHES), and hard clip those bases off the read.
mkdir -p Results/dropseqtools/05_TrimStartingSequence
Tools/Drop-seq_tools-1.13/TrimStartingSequence INPUT=Results/dropseqtools/04_FilterBAM/unaligned_tagged_filtered.bam OUTPUT=Results/dropseqtools/05_TrimStartingSequence/unaligned_tagged_trimmed_smart.bam OUTPUT_SUMMARY=Results/dropseqtools/05_TrimStartingSequence/adapter_trimming_report.txt SEQUENCE=CTGTCTCTTATA MISMATCHES=0 NUM_BASES=5
#7min
Parameters list :
OUTPUT_SUMMARY=File : The output summary statistics Default value: null.
SEQUENCE=String : The sequence to look for at the start of reads. Required.
MISMATCHES=Integer : How many mismatches are acceptable in the sequence. Default value: 0. This option can be set to 'null' to clear the default value.
NUM_BASES=Integer : How many bases at the begining of the sequence must match before trimming occurs. Default value: 0. This option can be set to 'null' to clear the default value.
TRIM_TAG=String : The tag to set for trimmed reads. This tags the first base to keep in the read. 6 would mean to trim the first 5 bases. Default value: ZS. This option can be set to 'null' to clear the default value.
Standard Illumina adapter sequences can be fetched from Illumina adapter sequences PDF
e.g : AmpliSeq for Illumina Panels : "The following sequence is needed for adapter trimming : CTGTCTCTTATACACATCT"
This program is the second sequence cleanup program designed to trim away trailing polyA tails from reads. It searches for at least 6 (NUM_BASES) contiguous A’s in the read with 0 mismatches (MISMATCHES), and hard clips the read to remove these bases and all bases 3’ of the polyA run.
mkdir -p Results/dropseqtools/06_PolyATrimmer
Tools/Drop-seq_tools-1.13/PolyATrimmer INPUT=Results/dropseqtools/05_TrimStartingSequence/unaligned_tagged_trimmed_smart.bam OUTPUT=Results/dropseqtools/06_PolyATrimmer/unaligned_mc_tagged_polyA_filtered.bam OUTPUT_SUMMARY=Results/dropseqtools/06_PolyATrimmer/polyA_trimming_report.txt MISMATCHES=0 NUM_BASES=6
#6min
Parameters list :
OUTPUT_SUMMARY=File : The output summary statistics Default value: null.
USE_NEW_TRIMMER=Boolean : Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MISMATCHES=Integer : How many mismatches are acceptable in the sequence (old trim algo). Default value: 0. This option can be set to 'null' to clear the default value.
NUM_BASES=Integer : How many bases of polyA qualifies as a run of A's (old trim algo). Default value: 6. This option can be set to 'null' to clear the default value.
TRIM_TAG=String : The tag to set for trimmed reads. This tags the first base to exclude in the read. 37 would mean to retain the first 36 bases. Default value: ZP. This option can be set to 'null' to clear the default value.
ADAPTER=AdapterDescriptor : Symbolic & literal specification of adapter sequence. This is a combination of fixed bases to match, and references to SAMRecord tag values. E.g. '~XM^XCACGT' means 'RCed value of XM tag' + 'value of XC tag' + 'ACGT'. Ideally this is at least as long as the read (new trim algo) Default value: ^XM^XCACGTACTCTGCGTTGCTACCACTG. This option can be set to 'null' to clear the default value.
MAX_ADAPTER_ERROR_RATE=Double : Fraction of bases that can mismatch when looking for adapter match (new trim algo) Default value: 0.1. This option can be set to 'null' to clear the default value.
MIN_ADAPTER_MATCH=Integer : Minimum number of bases for adapter match (new trim algo) Default value: 4. This option can be set to 'null' to clear the default value.
MIN_POLY_A_LENGTH=Integer : Minimum length of a poly A run, except when start of end of read intervenes (new trim algo) Default value: 20. This option can be set to 'null' to clear the default value.
MIN_POLY_A_LENGTH_NO_ADAPTER_MATCH=Integer Minimum length of poly A run at end of read, if there is no adapter match (new trim algo) Default value: 6. This option can be set to 'null' to clear the default value.
DUBIOUS_ADAPTER_MATCH_LENGTH=Integer : If adapter match is at end of read, with fewer than this many bases matching the read, and not enough poly A is found preceding it, then ignore the adapter match and try again from the end of the read (new trim algo) Default value: 6. This option can be set to 'null' to clear the default value.
MAX_POLY_A_ERROR_RATE=Double : When looking for poly A, allow this fraction of bases not to be A (new trim algo)
Default value: 0.1. This option can be set to 'null' to clear the default value.
mkdir -p Results/dropseqtools/07_SamToFastq
java -jar /usr/local/genome2/picard-tools/SamToFastq.jar INPUT=Results/dropseqtools/06_PolyATrimmer/unaligned_mc_tagged_polyA_filtered.bam FASTQ=Results/dropseqtools/07_SamToFastq/unaligned_mc_tagged_polyA_filtered.fastq
#9min
Parameters list :
FASTQ=File : Output fastq file (single-end fastq or, if paired, first end of the pair fastq). Required. Cannot be used in conjuction with option(s) OUTPUT_PER_RG (OPRG)
SECOND_END_FASTQ=File : Output fastq file (if paired, second end of the pair fastq). Default value: null. Cannot be used in conjuction with option(s) OUTPUT_PER_RG (OPRG)
UNPAIRED_FASTQ=File : Output fastq file for unpaired reads; may only be provided in paired-fastq mode Default value: null. Cannot be used in conjuction with option(s) OUTPUT_PER_RG (OPRG)
OUTPUT_PER_RG=Boolean : Output a fastq file per read group (two fastq files per read group if the group is paired). Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false} Cannot be used in conjuction with option(s) SECOND_END_FASTQ (F2) UNPAIRED_FASTQ (FU) FASTQ (F)
OUTPUT_DIR=File : Directory in which to output the fastq file(s). Used only when OUTPUT_PER_RG is true. Default value: null.
RE_REVERSE=Boolean : Re-reverse bases and qualities of reads with negative strand flag set before writing them to fastq Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INTERLEAVE=Boolean : Will generate an interleaved fastq if paired, each line will have /1 or /2 to describe which end it came from Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INCLUDE_NON_PF_READS=Boolean : Include non-PF reads from the SAM file into the output FASTQ files. PF means 'passes filtering'. Reads whose 'not passing quality controls' flag is set are non-PF reads. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
CLIPPING_ATTRIBUTE=String : The attribute that stores the position at which the SAM record should be clipped Default value: null.
CLIPPING_ACTION=String : The action that should be taken with clipped reads: 'X' means the reads and qualities should be trimmed at the clipped position; 'N' means the bases should be changed to Ns in the clipped region; and any integer means that the base qualities should be set to that value in the clipped region. Default value: null.
READ1_TRIM=Integer : The number of bases to trim from the beginning of read 1. Default value: 0. This option can be set to 'null' to clear the default value.
READ1_MAX_BASES_TO_WRITE=Integer : The maximum number of bases to write from read 1 after trimming. If there are fewer than this many bases left after trimming, all will be written. If this value is null then all bases left after trimming will be written. Default value: null.
READ2_TRIM=Integer : The number of bases to trim from the beginning of read 2. Default value: 0. This option can be set to 'null' to clear the default value.
READ2_MAX_BASES_TO_WRITE=Integer : The maximum number of bases to write from read 2 after trimming. If there are fewer than this many bases left after trimming, all will be written. If this value is null then all bases left after trimming will be written. Default value: null.
INCLUDE_NON_PRIMARY_ALIGNMENTS=Boolean : If true, include non-primary alignments in the output. Support of non-primary alignments in SamToFastq is not comprehensive, so there may be exceptions if this is set to true and there are paired reads with non-primary alignments. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
After all these cleaning steps, we went from 7'197'662 reads in "hgmm_100_R2.fastq" to 7'161'989 reads in "unaligned_mc_tagged_polyA_filtered.fastq". So now, let's map them.
mkdir -p Results/dropseqtools/08_STAR
STAR --runThreadN 8 --genomeDir /db/off_biomaj/cellranger/refdata-cellranger-hg19_and_mm10-2.1.0/star --readFilesIn Results/dropseqtools/07_SamToFastq/unaligned_mc_tagged_polyA_filtered.fastq --outFileNamePrefix Results/dropseqtools/08_STAR/star --outSAMtype BAM SortedByCoordinate
#Coordinate sorted for visualization
#17min
Parameters list : see STAR Manual
mkdir -p Results/dropseqtools/09_SortSam
java -jar /usr/local/genome2/picard-tools/SortSam.jar I=Results/dropseqtools/08_STAR/starAligned.sortedByCoord.out.bam O=Results/dropseqtools/09_SortSam/aligned.sorted.bam SORT_ORDER=queryname
#5min
SORT_ORDER=SortOrder : Sort order of output file Required. Possible values: {unsorted, queryname, coordinate}
This program merges the sorted alignment output from STAR (ALIGNED_BAM) with the unaligned BAM that had been previously tagged with molecular/cell barcodes (UNMAPPED_BAM). This recovers the BAM tags that were “lost” during alignment. The REFERENCE_SEQUENCE argument refers to the fasta metadata file. We ignore secondary alignments, as we want only the best alignment from STAR (or another aligner), instead of assigning a single sequencing read to multiple locations on the genome.
mkdir -p Results/dropseqtools/10_MergeBamAlignment
java -jar /usr/local/genome2/picard-tools/MergeBamAlignment.jar REFERENCE_SEQUENCE=/db/off_biomaj/cellranger/refdata-cellranger-hg19_and_mm10-2.1.0/fasta/genome.fa UNMAPPED_BAM=Results/dropseqtools/06_PolyATrimmer/unaligned_mc_tagged_polyA_filtered.bam ALIGNED_BAM=Results/dropseqtools/09_SortSam/aligned.sorted.bam OUTPUT=Results/dropseqtools/10_MergeBamAlignment/merged.bam INCLUDE_SECONDARY_ALIGNMENTS=false PAIRED_RUN=false
#6min
UNMAPPED_BAM=File : Original SAM or BAM file of unmapped reads, which must be in queryname order. Required.
ALIGNED_BAM=File : SAM or BAM file(s) with alignment data. This option may be specified 0 or more times. Cannot be used in conjuction with option(s) READ1_ALIGNED_BAM (R1_ALIGNED) READ2_ALIGNED_BAM (R2_ALIGNED)
READ1_ALIGNED_BAM=File : SAM or BAM file(s) with alignment data from the first read of a pair. This option may be specified 0 or more times. Cannot be used in conjuction with option(s) ALIGNED_BAM (ALIGNED)
READ2_ALIGNED_BAM=File : SAM or BAM file(s) with alignment data from the second read of a pair. This option may be specified 0 or more times. Cannot be used in conjuction with option(s) ALIGNED_BAM (ALIGNED)
OUTPUT=File : Merged SAM or BAM file to write to. Required.
REFERENCE_SEQUENCE=File : Path to the fasta file for the reference sequence. Required.
PROGRAM_RECORD_ID=String : The program group ID of the aligner (if not supplied by the aligned file). Default value: null.
PROGRAM_GROUP_VERSION=String : The version of the program group (if not supplied by the aligned file). Default value: null.
PROGRAM_GROUP_COMMAND_LINE=String : The command line of the program group (if not supplied by the aligned file). Default value: null.
PROGRAM_GROUP_NAME=String : The name of the program group (if not supplied by the aligned file). Default value: null.
PAIRED_RUN=Boolean : This argument is ignored and will be removed. Required. Possible values: {true, false}
JUMP_SIZE=Integer : The expected jump size (required if this is a jumping library). Deprecated. Use EXPECTED_ORIENTATIONS instead Default value: null. Cannot be used in conjuction with option(s) EXPECTED_ORIENTATIONS (ORIENTATIONS)
CLIP_ADAPTERS=Boolean : Whether to clip adapters where identified. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
IS_BISULFITE_SEQUENCE=Boolean : Whether the lane is bisulfite sequence (used when caculating the NM tag). Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ALIGNED_READS_ONLY=Boolean : Whether to output only aligned reads. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MAX_INSERTIONS_OR_DELETIONS=Integer : The maximum number of insertions or deletions permitted for an alignment to be included. Alignments with more than this many insertions or deletions will be ignored. Set to -1 to allow any number of insertions or deletions. Default value: 1. This option can be set to 'null' to clear the default value.
ATTRIBUTES_TO_RETAIN=String : Reserved alignment attributes (tags starting with X, Y, or Z) that should be brought over from the alignment data when merging. This option may be specified 0 or more times.
ATTRIBUTES_TO_REMOVE=String : Attributes from the alignment record that should be removed when merging. This overrides ATTRIBUTES_TO_RETAIN if they share common tags. This option may be specified 0 or more times.
READ1_TRIM=Integer : The number of bases trimmed from the beginning of read 1 prior to alignment Default value: 0. This option can be set to 'null' to clear the default value.
READ2_TRIM=Integer : The number of bases trimmed from the beginning of read 2 prior to alignment Default value: 0. This option can be set to 'null' to clear the default value.
EXPECTED_ORIENTATIONS=PairOrientation : The expected orientation of proper read pairs. Replaces JUMP_SIZE Possible values: {FR, RF, TANDEM} This option may be specified 0 or more times. Cannot be used in conjuction with option(s) JUMP_SIZE (JUMP)
ALIGNER_PROPER_PAIR_FLAGS=Boolean : Use the aligner's idea of what a proper pair is rather than computing in this program. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
SORT_ORDER=SortOrder : The order in which the merged reads should be output. Default value: coordinate. This option can be set to 'null' to clear the default value. Possible values: {unsorted, queryname, coordinate}
PRIMARY_ALIGNMENT_STRATEGY=PrimaryAlignmentStrategy : Strategy for selecting primary alignment when the aligner has provided more than one alignment for a pair or fragment, and none are marked as primary, more than one is marked as primary, or the primary alignment is filtered out for some reason. BestMapq expects that multiple alignments will be correlated with HI tag, and prefers the pair of alignments with the largest MAPQ, in the absence of a primary selected by the aligner. EarliestFragment prefers the alignment which maps the earliest base in the read. Note that EarliestFragment may not be used for paired reads. BestEndMapq is appropriate for cases in which the aligner is not pair-aware, and does not output the HI tag. It simply picks the alignment for each end with the highest MAPQ, and makes those alignments primary, regardless of whether the two alignments make sense together.MostDistant is also for a non-pair-aware aligner, and picks the alignment pair with the largest insert size. If all alignments would be chimeric, it picks the alignments for each end with the best MAPQ. For all algorithms, ties are resolved arbitrarily. Default value: BestMapq. This option can be set to 'null' to clear the default value. Possible values: {BestMapq, EarliestFragment, BestEndMapq, MostDistant}
CLIP_OVERLAPPING_READS=Boolean : For paired reads, soft clip the 3' end of each read if necessary so that it does not extend past the 5' end of its mate. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INCLUDE_SECONDARY_ALIGNMENTS=Boolean : If false, do not write secondary alignments to output. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ADD_MATE_CIGAR=Boolean : Adds the mate CIGAR tag (MC) if true, does not if false. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
This program that adds a BAM tag “GE” onto reads when the read overlaps the exon of a gene. This tag contains the name of the gene, as reported in the annotations file. You can use either a GTF or a RefFlat annotation file with this program, depending on what annotation data source you find most useful. This tag is used later when we will extract digital gene expression (DGE) from the BAM.
mkdir -p Results/dropseqtools/11_TagReadWithGeneExon
Tools/Drop-seq_tools-1.13/TagReadWithGeneExon I=Results/dropseqtools/10_MergeBamAlignment/merged.bam O=Results/dropseqtools/11_TagReadWithGeneExon/star_gene_exon_tagged.bam ANNOTATIONS_FILE=/db/off_biomaj/cellranger/refdata-cellranger-hg19_and_mm10-2.1.0/genes/genes.gtf TAG=GE
#7min
SUMMARY=File : The strand specific summary info Default value: null.
TAG=String : The tag name to use. When there are multiple genes, they will be comma seperated. Default value: GE. This option can be set to 'null' to clear the default value.
STRAND_TAG=String : The strand of the gene(s) the read overlaps. When there are multiple genes, they will be comma seperated. Default value: GS. This option can be set to 'null' to clear the default value.
FUNCTION_TAG=String : The functional annotation for the read Default value: XF. This option can be set to 'null' to clear the default value.
ANNOTATIONS_FILE=File : The annotations set to use to label the read. This can be a GTF or a refFlat file. Required.
USE_STRAND_INFO=Boolean : Use strand info to determine what gene to assign the read to. If this is on, reads can be assigned to a maximum one one gene. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ALLOW_MULTI_GENE_READS=Boolean : Allow a read to span multiple genes. If set to true, the gene name will be set to all of the gene/exons the read spans. In that case, the gene names will be comma separated. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
This program identifies cell barcodes with aberrant “fixed” UMI bases. If only the last UMI base is fixed as a T, the cell barcode is corrected (the last base is trimmed off) and all cell barcodes with identical sequence at the first 15 bases are merged together. If any other UMI base is fixed, the reads with that cell barcode are discarded. The program asks the user to select a number of barcodes on which to perform the correction. We use roughly 2 times the anticipated number cells, as we empirically found that this allows us to recover nearly every defective cell barcode that corresponds to a STAMP (Single-cell Transcriptomes Attached to MicroParticles), rather than an empty bead cell barcode.
All the relationships between UMIs and PCR/sequencing errors are described on the CGAT website
mkdir -p Results/dropseqtools/12_DetectBeadSynthesisErrors
Tools/Drop-seq_tools-1.13/DetectBeadSynthesisErrors I=Results/dropseqtools/11_TagReadWithGeneExon/star_gene_exon_tagged.bam O=Results/dropseqtools/12_DetectBeadSynthesisErrors/star_gene_exon_tagged_cleaned.bam OUTPUT_STATS=Results/dropseqtools/12_DetectBeadSynthesisErrors/star_gene_exon_tagged.synthesis_stats.txt SUMMARY=Results/dropseqtools/12_DetectBeadSynthesisErrors/star_gene_exon_tagged.synthesis_stats.summary.txt NUM_BARCODES=200 PRIMER_SEQUENCE=CTGTCTCTTATA CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE
#5min
PRIMER_SEQUENCE=String : The sequence of the primer. Default value: null.
EDIT_DISTANCE=Integer : When looking at fixed UMIs, see if the edit distance from the UMI to the primer is within this threshold. 0 indicates a perfect match between the primer and the UMI. Default value: 0. This option can be set to 'null' to clear the default value.
CELL_BARCODE_TAG=String : The cell barcode tag. Default value: XC. This option can be set to 'null' to clear the default value.
MOLECULAR_BARCODE_TAG=String : The molecular barcode tag. Default value: XM. This option can be set to 'null' to clear the default value.
GENE_EXON_TAG=String : The Gene/Exon tag Default value: GE. This option can be set to 'null' to clear the default value.
STRAND_TAG=String : The strand of the gene(s) the read overlaps. When there are multiple genes, they will be comma-separated. Default value: GS. This option can be set to 'null' to clear the default value.
READ_MQ=Integer : The map quality of the read to be included when calculating the barcodes in Default value: 10. This option can be set to 'null' to clear the default value.
MIN_UMIS_PER_CELL=Integer : The minimum number of UMIs required to report a cell barcode Default value: 25. This option can be set to 'null' to clear the default value.
NUM_BARCODES=Integer : Find the top set of most common barcodes by HQ reads and only use this set for analysis. Required. Cannot be used in conjuction with option(s) CELL_BC_FILE
CELL_BC_FILE=File : Override NUM_BARCODES, and process reads that have the cell barcodes in this file instead. The file has 1 column with no header. Required. Cannot be used in conjuction with option(s) NUM_BARCODES
MAX_NUM_ERRORS=Integer : Repair Synthesis errors with at most this many missing bases detected. Default value: 1. This option can be set to 'null' to clear the default value.
A key question to answer for your data set is how many cells you want to extract from your BAM. One way to estimate this is to extract the number of reads per cell, then plot the cumulative distribution of reads and select the "knee"" of the distribution. We provide a tool to extract the reads per cell barcode in the Drop-seq software called BAMTagHistogram. This extracts the number of reads for any BAM tag in a BAM file, and is a general purpose tool you can use for a number of purposes. For this purpose, we extract the cell tag "XC".
mkdir -p Results/dropseqtools/13_BAMTagHistogram
Tools/Drop-seq_tools-1.13/BAMTagHistogram I=Results/dropseqtools/12_DetectBeadSynthesisErrors/star_gene_exon_tagged_cleaned.bam O=Results/dropseqtools/13_BAMTagHistogram/hgmm_100_cell_readcounts_cleaned.txt.gz TAG=XC
#1min
TAG=String : Tag to extract Required.
FILTER_PCR_DUPLICATES=Boolean : Filter PCR Duplicates. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
READ_QUALITY=Integer : Read quality filter. Filters all reads lower than this mapping quality. Defaults to 10. Set to 0 to not filter reads by map quality. Default value: 10. This option can be set to 'null' to clear the default value.
hgmm_100_cell_readcounts_cleaned<-read.table("html_files/hgmm_100_cell_readcounts_cleaned.txt", header=F, stringsAsFactors=F)
hgmm_100_cell_readcounts_cleaned_cumsum<-cumsum(hgmm_100_cell_readcounts_cleaned$V1)
hgmm_100_cell_readcounts_cleaned_cumsum<-hgmm_100_cell_readcounts_cleaned_cumsum/max(hgmm_100_cell_readcounts_cleaned_cumsum)
plot(1:length(hgmm_100_cell_readcounts_cleaned_cumsum), hgmm_100_cell_readcounts_cleaned_cumsum, type='l', col="blue", xlab="cell barcodes sorted by number of reads [descending]",
ylab="cumulative fraction of reads", xlim=c(1,500))
mkdir -p Results/dropseqtools/14_DigitalExpression
Tools/Drop-seq_tools-1.13/DigitalExpression I=Results/dropseqtools/12_DetectBeadSynthesisErrors/star_gene_exon_tagged_cleaned.bam O=Results/dropseqtools/14_DigitalExpression/star_gene_exon_tagged_cleaned.dge.txt.gz SUMMARY=Results/dropseqtools/14_DigitalExpression/star_gene_exon_tagged_cleaned.dge.summary.txt NUM_CORE_BARCODES=100
#2min
SUMMARY=File : A summary of the digital expression output, containing 3 columns - the cell barcode, the #genes, and the #transcripts. Default value: null.
OUTPUT_READS_INSTEAD=Boolean : Output number of reads instead of number of unique molecular barcodes. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
OUTPUT=File : Output file of DGE Matrix. Genes are in rows, cells in columns. The first column contains the gene name. This supports zipped formats like gz and bz2. Required.
MIN_SUM_EXPRESSION=Integer : Output only genes with at least this total expression level, after summing across all cells Default value: null.
OUTPUT_HEADER=Boolean : If true, write a header in the DGE file. If not specified, and UEI is specified, it is set to true Default value: null. Possible values: {true, false}
UNIQUE_EXPERIMENT_ID=String : If OUTPUT_HEADER=true, this is required Default value: null.
REFERENCE=File : Reference to which BAM is aligned. This is only used to put into DGE header, if OUTPUT_HEADER=true. If not specified, it is extracted from the INPUT header if possible. Default value: null.
INPUT=File : The input SAM or BAM file to analyze. Required.
CELL_BARCODE_TAG=String : The cell barcode tag. If there are no reads with this tag, the program will assume that all reads belong to the same cell and process in single sample mode. Default value: XC. This option can be set to 'null' to clear the default value.
MOLECULAR_BARCODE_TAG=String : The molecular barcode tag. Default value: XM. This option can be set to 'null' to clear the default value.
GENE_EXON_TAG=String : The Gene/Exon tag Default value: GE. This option can be set to 'null' to clear the default value.
STRAND_TAG=String : The strand of the gene(s) the read overlaps. When there are multiple genes, they will be comma seperated. Default value: GS. This option can be set to 'null' to clear the default value.
EDIT_DISTANCE=Integer : The edit distance that molecular barcodes should be combined at within a gene. Default value: 1. This option can be set to 'null' to clear the default value.
READ_MQ=Integer : The map quality of the read to be included. Default value: 10. This option can be set to 'null' to clear the default value.
MIN_BC_READ_THRESHOLD=Integer : The minimum number of reads a molecular barcode should have to be considered. This is done AFTER edit distance collapse of barcodes. Default value: 0. This option can be set to 'null' to clear the default value.
MIN_NUM_READS_PER_CELL=Integer : Gather up all cell barcodes that have more than some number of reads. Default value: null.
MIN_NUM_GENES_PER_CELL=Integer : The minumum number of genes for a cell barcode to be reported. Default value: null.
MIN_NUM_TRANSCRIPTS_PER_CELL=Integer : The minumum number of transcripts for a cell barcode to be reported. Default value: null.
NUM_CORE_BARCODES=Integer : Number of cells that you think are in the library. This accomplishes the same goals as the MIN_NUM_READS_CORE argument, but instead of defining barcodes as important based on the number of reads, it picks the top barcodes as core. Default value: null.
CELL_BC_FILE=File : Override CELL_BARCODE and MIN_NUM_READS_PER_CELL, and process reads that have the cell barcodes in this file instead. The file has 1 column with no header. Default value: null.
USE_STRAND_INFO=Boolean : Is the library stranded? If so, use the strand info to more precisely place reads on the correct gene, and ignore reads that are on the wrong strand. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
RARE_UMI_FILTER_THRESHOLD=Double : Drop UMIs within a cell/gene pair that occur less than the average number of reads* for all UMIs in the cell/gene pair. For example, if you had on average 1000 reads per UMI and a UMI with 1-10 reads, those small UMIs would be removed when this frequency was set to 0.01.This is off by default. A sensible value might be 0.01. Default value: 0.0. This option can be set to 'null' to clear the default value.
dropseqtools_filtered_hg19_mm10.data<-read.table("html_files/star_gene_exon_tagged_cleaned.dge.txt",header=T,row.names = 1)
dropseqtools_filtered_hg19_mm10.data[1:4,1:4]
## GACTGCGAGGGCATGT GGTGCGTAGGCTACGA ATGAGGGAGTAGTGCG
## hg19_A1BG 6 7 0
## hg19_A1BG-AS1 0 0 0
## hg19_A2M-AS1 0 0 0
## hg19_A4GALT 1 0 0
## CGAACATTCTGATACG
## hg19_A1BG 1
## hg19_A1BG-AS1 1
## hg19_A2M-AS1 0
## hg19_A4GALT 0
dropseqtools_filtered_hg19_mm10<-CreateSeuratObject(raw.data = dropseqtools_filtered_hg19_mm10.data, min.cells = 3, min.genes = 200, project = "sincellte_dropseqtools")
## Error in CreateSeuratObject(raw.data = dropseqtools_filtered_hg19_mm10.data, : could not find function "CreateSeuratObject"
source /usr/local/genome2/conda3/bin/activate sincellte2018
mkdir -p Results/umitools/01_FastQC
fastqc -t 8 -o Results/umitools/01_FastQC Data/*.fastq
#4min
mkdir -p Results/umitools/02_Trim_galore
trim_galore -q 30 --fastqc_args "-t 8" -o Results/umitools/02_Trim_galore --paired --dont_gzip --stringency 5 --length 26 Data/hgmm_100_R1.fastq Data/hgmm_100_R2.fastq
#28min
This program extract cell barcodes and identify the most likely true barcodes using the 'knee' method. In the absence of the --set-cell-number options (see below), we use the distribution of counts per cell barcode to identify the cut-off for 'true' UMIs (the 'knee').
mkdir -p Results/umitools/03_Whitelist
umi_tools whitelist --stdin=Results/umitools/02_Trim_galore/hgmm_100_R1_val_1.fq --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN --plot-prefix=Results/umitools/03_Whitelist/hgmm100_umi_tools_whitelist --stdout=Results/umitools/03_Whitelist/hgmm100_umi_tools_whitelist.txt --log=Results/umitools/03_Whitelist/hgmm100_umi_tools_whitelist.log
#4min
mkdir -p Results/umitools/04_Extract
umi_tools extract --stdin=Results/umitools/02_Trim_galore/hgmm_100_R1_val_1.fq --stdout=Results/umitools/04_Extract/hgmm100_umi_tools_extract --read2-in=Results/umitools/02_Trim_galore/hgmm_100_R2_val_2.fq --read2-out=Results/umitools/04_Extract/hgmm100_umi_tools_extract_R2.fq --whitelist=Results/umitools/03_Whitelist/hgmm100_umi_tools_whitelist.txt --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN --filter-cell-barcode --quality-filter-threshold=30 --quality-encoding=phred33 --error-correct-cell
#15min
mkdir -p Results/umitools/05_STAR
STAR --runThreadN 8 --genomeDir /db/off_biomaj/cellranger/refdata-cellranger-hg19_and_mm10-2.1.0/star --outFileNamePrefix Results/umitools/05_STAR/hgmm100_star_hg19_mm10_ --outSAMunmapped Within --readFilesIn Results/umitools/04_Extract/hgmm100_umi_tools_extract_R2.fq
#13min
mkdir -p Results/umitools/06_MultiQC
cd Results/umitools/06_MultiQC
multiqc -f ../01_FastQC ../02_Trim_galore/hgmm_100_R1_val_1_fastqc.html ../02_Trim_galore/hgmm_100_R1_val_1_fastqc.zip ../02_Trim_galore/hgmm_100_R1_val_1.fq ../02_Trim_galore/hgmm_100_R2_val_2_fastqc.html ../02_Trim_galore/hgmm_100_R2_val_2_fastqc.zip ../02_Trim_galore/hgmm_100_R2_val_2.fq ../05_STAR
#1min
cd ${ROOT_FOLDER}
mkdir -p Results/umitools/07_FeatureCounts
featureCounts -g gene_name -a /db/off_biomaj/cellranger/refdata-cellranger-hg19_and_mm10-2.1.0/genes/genes.gtf -T 8 -o Results/umitools/07_FeatureCounts/hgmm100_gene_assigned.txt -R SAM Results/umitools/05_STAR/hgmm100_star_hg19_mm10_Aligned.out.sam
#11min
mkdir -p Results/umitools/08_Sambamba
/usr/local/genome2/bin/sambamba view -t 8 -S -f bam Results/umitools/07_FeatureCounts/hgmm100_star_hg19_mm10_Aligned.out.sam.featureCounts.sam -o Results/umitools/08_Sambamba/hgmm100_star_hg19_mm10_Aligned.out.sam.featureCounts.bam
/usr/local/genome2/bin/sambamba sort -t 8 -o Results/umitools/08_Sambamba/hgmm100_star_hg19_mm10_Aligned.out.sam.featureCounts.sorted.bam Results/umitools/08_Sambamba/hgmm100_star_hg19_mm10_Aligned.out.sam.featureCounts.bam
#11min
mkdir -p Results/umitools/09_Count
umi_tools count --per-gene --gene-tag=XT --per-cell -I Results/umitools/08_Sambamba/hgmm100_star_hg19_mm10_Aligned.out.sam.featureCounts.sorted.bam -S Results/umitools/09_Count/hgmm100_star_hg19_mm10_Aligned.out.sam.featureCounts.sorted.counts.tsv
#2min
Nathalie LEHMANN and Laurent JOURDREN short presentation
If you have very large datasets, you may consider to replace STAR+FeatureCounts approach with Salmon or kallisto approaches.
Single-cell RNA-seq analysis with kallisto
10X hgmm 1:1 mixture analysis with kallisto
#count_table %>% ggplot(aes(x = log10(1+cellranger), y = log10(1+dropseqtools))) + geom_hex() + facet_wrap(~cell) + theme(strip.text = element_text(size=1))
count_table %>%
filter(cell == "GCAAACTTCCCTCAGT") %>%
ggplot(aes(x = cellranger, y = dropseqtools)) + geom_point() + scale_x_log10() + scale_y_log10()
correlations<-count_table %>%
group_by(cell) %>%
summarise(cor = cor(log10(1+cellranger), log10(1+dropseqtools), use = "pair"))
boxplot(correlations$cor)
#count_table %>% ggplot(aes(x = log10(1+cellranger), y = log10(1+umitools))) + geom_hex() + facet_wrap(~cell) + theme(strip.text = element_text(size=1))
count_table %>%
filter(cell == "GCAAACTTCCCTCAGT") %>%
ggplot(aes(x = cellranger, y = umitools)) + geom_point() + scale_x_log10() + scale_y_log10()
correlations<-count_table %>%
group_by(cell) %>%
summarise(cor = cor(log10(1+cellranger), log10(1+umitools), use = "pair"))
boxplot(correlations$cor)
#count_table %>% ggplot(aes(x = log10(1+umitools), y = log10(1+dropseqtools))) + geom_hex() + facet_wrap(~cell) + theme(strip.text = element_text(size=1))
count_table %>%
filter(cell == "GCAAACTTCCCTCAGT") %>%
ggplot(aes(x = umitools, y = dropseqtools)) + geom_point() + scale_x_log10() + scale_y_log10()
correlations<-count_table %>%
group_by(cell) %>%
summarise(cor = cor(log10(1+umitools), log10(1+dropseqtools), use = "pair"))
boxplot(correlations$cor)
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04 LTS
##
## Matrix products: default
## BLAS: /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 Seurat_2.3.0 Matrix_1.2-14 cowplot_0.9.2
## [5] forcats_0.3.0 stringr_1.3.0 dplyr_0.7.4 purrr_0.2.4
## [9] readr_1.1.1 tidyr_0.8.0 tibble_1.4.2 ggplot2_2.2.1
## [13] tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] readxl_1.1.0 snow_0.4-2 backports_1.1.2
## [4] Hmisc_4.1-1 VGAM_1.0-5 sn_1.5-2
## [7] plyr_1.8.4 igraph_1.2.1 lazyeval_0.2.1
## [10] splines_3.4.1 digest_0.6.15 foreach_1.4.4
## [13] htmltools_0.3.6 lars_1.2 gdata_2.18.0
## [16] magrittr_1.5 checkmate_1.8.5 cluster_2.0.7-1
## [19] mixtools_1.1.0 ROCR_1.0-7 sfsmisc_1.1-2
## [22] recipes_0.1.2 modelr_0.1.1 gower_0.1.2
## [25] dimRed_0.1.0 R.utils_2.6.0 colorspace_1.3-2
## [28] rvest_0.3.2 haven_1.1.1 crayon_1.3.4
## [31] jsonlite_1.5 bindr_0.1.1 zoo_1.8-1
## [34] survival_2.42-3 iterators_1.0.9 ape_5.1
## [37] glue_1.2.0 DRR_0.0.3 gtable_0.2.0
## [40] ipred_0.9-6 kernlab_0.9-25 ddalpha_1.3.2
## [43] prabclus_2.2-6 DEoptimR_1.0-8 abind_1.4-5
## [46] scales_0.5.0 mvtnorm_1.0-7 Rcpp_0.12.16
## [49] metap_0.9 dtw_1.18-1 htmlTable_1.11.2
## [52] tclust_1.3-1 magic_1.5-8 foreign_0.8-70
## [55] proxy_0.4-22 mclust_5.4 SDMTools_1.1-221
## [58] Formula_1.2-2 tsne_0.1-3 stats4_3.4.1
## [61] lava_1.6.1 prodlim_2018.04.18 htmlwidgets_1.2
## [64] httr_1.3.1 FNN_1.1 gplots_3.0.1
## [67] RColorBrewer_1.1-2 fpc_2.1-11 acepack_1.4.1
## [70] modeltools_0.2-21 ica_1.0-1 pkgconfig_2.0.1
## [73] R.methodsS3_1.7.1 flexmix_2.3-14 nnet_7.3-12
## [76] caret_6.0-79 tidyselect_0.2.4 rlang_0.2.0
## [79] reshape2_1.4.3 munsell_0.4.3 cellranger_1.1.0
## [82] tools_3.4.1 cli_1.0.0 ranger_0.9.0
## [85] ggridges_0.5.0 broom_0.4.4 evaluate_0.10.1
## [88] geometry_0.3-6 yaml_2.1.18 ModelMetrics_1.1.0
## [91] knitr_1.20 fitdistrplus_1.0-9 robustbase_0.93-0
## [94] caTools_1.17.1 RANN_2.5.1 pbapply_1.3-4
## [97] nlme_3.1-137 R.oo_1.22.0 RcppRoll_0.2.2
## [100] xml2_1.2.0 compiler_3.4.1 rstudioapi_0.7
## [103] png_0.1-7 stringi_1.1.7 lattice_0.20-35
## [106] trimcluster_0.1-2 psych_1.8.3.3 diffusionMap_1.1-0
## [109] pillar_1.2.2 lmtest_0.9-36 irlba_2.3.2
## [112] data.table_1.11.4 bitops_1.0-6 R6_2.2.2
## [115] latticeExtra_0.6-28 KernSmooth_2.23-15 gridExtra_2.3
## [118] codetools_0.2-15 MASS_7.3-49 gtools_3.5.0
## [121] assertthat_0.2.0 CVST_0.2-1 rprojroot_1.3-2
## [124] withr_2.1.2 mnormt_1.5-5 diptest_0.75-7
## [127] parallel_3.4.1 doSNOW_1.0.16 hms_0.4.2
## [130] grid_3.4.1 rpart_4.1-13 timeDate_3043.102
## [133] class_7.3-14 rmarkdown_1.9 segmented_0.5-3.0
## [136] Rtsne_0.13 numDeriv_2016.8-1 scatterplot3d_0.3-41
## [139] lubridate_1.7.4 base64enc_0.1-3